GPT 5.4 AI News List

Time	Details
2026-04-12 09:58	Claude Mythos vs Opus 4.6 and GPT 5.4: Looped Language Model Breakthrough Dominates GraphWalks and SWE-bench – 2026 Analysis According to @godofprompt on X, citing an analysis by Chris Hayduk and ByteDance’s paper Scaling Latent Reasoning via Looped Language Models, Claude Mythos may leverage looped transformer passes to refine latent reasoning before output, which aligns with its outsized gains on graph search tasks (as reported by @godofprompt). According to @godofprompt, Mythos scores 80% on GraphWalks BFS versus 38.7% for Anthropic’s Opus 4.6 and 21.4% for GPT 5.4, the exact area where ByteDance predicted looping would dominate. As reported by @godofprompt, Mythos also posts 77.8% on SWE-bench Pro versus 53.4%, 97.6% on USAMO versus 42.3%, 59% on SWE-bench Multimodal versus 27.1%, and 87.3% on SWE-bench Multilingual versus 77.8%, indicating broad benefits in software reasoning and multimodal code tasks. According to @godofprompt, a token efficiency chart shows Mythos reaching 86.9% on BrowseComp at 3M tokens, while Opus 4.6 needs 10M+ tokens to reach 74%, suggesting internal latent computation reduces token usage compared with explicit chain-of-thought. These third-party claims, sourced to X posts by @godofprompt referencing Chris Hayduk’s thread and ByteDance’s research, imply material business impacts: lower inference token costs, higher accuracy in enterprise code automation, and competitive differentiation via architectural loops rather than larger parameter counts. Source
2026-04-08 16:36	Meta Unveils Muse Spark: Multimodal Reasoning Model With Contemplating Mode—Benchmark Analysis and 2026 Business Impact According to The Rundown AI on X, Meta released Muse Spark, the first model from its Superintelligence Labs led by Alexandr Wang, featuring native multimodality, tool use, visual chain of thought, and a Contemplating mode that coordinates parallel agent reasoning. As reported by The Rundown AI, Muse Spark scores 50.2 on Humanity's Last Exam (no tools), surpassing Gemini 3.1 Deep Think at 48.4 and GPT 5.4 Pro at 43.9, and achieves 38.3 on FrontierScience Research, nearly double Gemini Deep Think's 23.3. According to The Rundown AI, Meta also disclosed gaps where Muse Spark trails: ARC AGI 2 at 42.5 versus Gemini's 76.5, and Terminal-Bench 2.0 at 59.0 versus GPT's 75.1. As reported by The Rundown AI, the model shows strong health reasoning aligned with Meta's personal superintelligence strategy and was built in nine months after a ground-up AI stack rebuild, with potential distribution across Meta’s 3.5B daily users to elevate assistant quality and agentic workflows. Source
2026-03-29 19:21	SlopCodeBench Analysis: Wisconsin and MIT Expose AI Coding Benchmark Failures with 11 Models, 93 Checkpoints, and 0 End to End Solves According to God of Prompt on X, researchers from the University of Wisconsin and MIT introduced SlopCodeBench, showing that pass rate focused AI coding benchmarks miss structural decay in iterative software development; across 11 models including Claude Opus 4.6 and GPT 5.4, zero models solved a problem end to end and verbosity rose in 89.8% of trajectories (as reported by God of Prompt). According to the same X thread, SlopCodeBench uses 20 problems and 93 checkpoints, forcing models to extend their own prior code with updated specs, revealing rising cyclomatic complexity and duplicated scaffolds even when tests continue to pass. As reported by God of Prompt, agent erosion measured 0.68 versus 0.31 for human maintained repos, agent verbosity 0.32 versus 0.11 for humans, costs grew 2.9x without correctness gains, and the highest strict solve rate across models was 17.2%. According to the thread, anti slop prompting reduced initial verbosity by 34.5% on GPT 5.4 but did not change the degradation slope, implying architectural incentives drive local optimizations that accumulate complexity—highlighting business risks for AI code assistants and the need for benchmarks that measure maintainability, extensibility, and lifecycle cost. Source
2026-03-12 05:26	OpenClaw 2026.3.11 Release: Free 1M-Context Models via OpenRouter, GPT 5.4 Fix, Gemini Embedding 2, Go Support, and Security Hardening According to @openclaw on X, OpenClaw’s 2026.3.11 release ships Hunter and Healer Alpha with free 1M-token context models available through OpenRouter, enabling ultra-long-context retrieval and RAG use cases at zero cost for developers (as reported by OpenClaw’s release notes on GitHub). According to OpenClaw, the update integrates Gemini Embedding 2 for improved long-term memory indexing, boosting vector search quality and memory recall in production pipelines. As reported by OpenClaw, GPT 5.4 behavior was tuned to prevent mid-thought stopping, reducing truncation issues in agent loops and code-gen tasks. According to the project’s GitHub release, OpenCode adds Go language support, expanding automated code assistance and test generation beyond Python and JS, while a dedicated security hardening sprint addresses dependency pinning, secret scanning, and sandbox tightening for safer model tooling. For businesses, according to OpenClaw, these changes lower LLM context costs, improve retrieval accuracy, and accelerate multi-language developer workflows, creating opportunities to build durable-memory agents and long-document analytics on top of OpenRouter and Gemini Embedding 2. Source
2026-03-07 02:34	LLM Fiction Benchmark Analysis: Why GPT 5.4 Pro, Claude, and Gemini 3.1 Pro Still Struggle With 10-Paragraph Mystery Writing According to Ethan Mollick on Twitter, a 10-paragraph murder-mystery benchmark exposes planning, clue calibration, and narrative consistency failures across leading LLMs, with Claude omitting key clues, ChatGPT 5.4 Pro over-signaling solutions, and Gemini 3.1 Pro mis-explaining an ice-based twist (as reported by Ethan Mollick on Twitter). According to Mollick, this task requires front-loading solvable but subtle evidence within five paragraphs while maintaining suspense, a structure that stresses multi-step narrative planning and constraint tracking in LLMs (according to Ethan Mollick on Twitter). For businesses deploying generative writing, the findings indicate risks in long-form content generation where hidden constraints matter—such as compliance narratives, educational case studies, and interactive fiction—highlighting the need for structured outline enforcement, tool-driven plot graphs, and post-hoc validation chains (according to Ethan Mollick on Twitter). Source

2026-04-12
09:58

Claude Mythos vs Opus 4.6 and GPT 5.4: Looped Language Model Breakthrough Dominates GraphWalks and SWE-bench – 2026 Analysis

According to @godofprompt on X, citing an analysis by Chris Hayduk and ByteDance’s paper Scaling Latent Reasoning via Looped Language Models, Claude Mythos may leverage looped transformer passes to refine latent reasoning before output, which aligns with its outsized gains on graph search tasks (as reported by @godofprompt). According to @godofprompt, Mythos scores 80% on GraphWalks BFS versus 38.7% for Anthropic’s Opus 4.6 and 21.4% for GPT 5.4, the exact area where ByteDance predicted looping would dominate. As reported by @godofprompt, Mythos also posts 77.8% on SWE-bench Pro versus 53.4%, 97.6% on USAMO versus 42.3%, 59% on SWE-bench Multimodal versus 27.1%, and 87.3% on SWE-bench Multilingual versus 77.8%, indicating broad benefits in software reasoning and multimodal code tasks. According to @godofprompt, a token efficiency chart shows Mythos reaching 86.9% on BrowseComp at 3M tokens, while Opus 4.6 needs 10M+ tokens to reach 74%, suggesting internal latent computation reduces token usage compared with explicit chain-of-thought. These third-party claims, sourced to X posts by @godofprompt referencing Chris Hayduk’s thread and ByteDance’s research, imply material business impacts: lower inference token costs, higher accuracy in enterprise code automation, and competitive differentiation via architectural loops rather than larger parameter counts.

Source

2026-04-08
16:36

Meta Unveils Muse Spark: Multimodal Reasoning Model With Contemplating Mode—Benchmark Analysis and 2026 Business Impact

According to The Rundown AI on X, Meta released Muse Spark, the first model from its Superintelligence Labs led by Alexandr Wang, featuring native multimodality, tool use, visual chain of thought, and a Contemplating mode that coordinates parallel agent reasoning. As reported by The Rundown AI, Muse Spark scores 50.2 on Humanity's Last Exam (no tools), surpassing Gemini 3.1 Deep Think at 48.4 and GPT 5.4 Pro at 43.9, and achieves 38.3 on FrontierScience Research, nearly double Gemini Deep Think's 23.3. According to The Rundown AI, Meta also disclosed gaps where Muse Spark trails: ARC AGI 2 at 42.5 versus Gemini's 76.5, and Terminal-Bench 2.0 at 59.0 versus GPT's 75.1. As reported by The Rundown AI, the model shows strong health reasoning aligned with Meta's personal superintelligence strategy and was built in nine months after a ground-up AI stack rebuild, with potential distribution across Meta’s 3.5B daily users to elevate assistant quality and agentic workflows.

Source

2026-03-29
19:21

SlopCodeBench Analysis: Wisconsin and MIT Expose AI Coding Benchmark Failures with 11 Models, 93 Checkpoints, and 0 End to End Solves

According to God of Prompt on X, researchers from the University of Wisconsin and MIT introduced SlopCodeBench, showing that pass rate focused AI coding benchmarks miss structural decay in iterative software development; across 11 models including Claude Opus 4.6 and GPT 5.4, zero models solved a problem end to end and verbosity rose in 89.8% of trajectories (as reported by God of Prompt). According to the same X thread, SlopCodeBench uses 20 problems and 93 checkpoints, forcing models to extend their own prior code with updated specs, revealing rising cyclomatic complexity and duplicated scaffolds even when tests continue to pass. As reported by God of Prompt, agent erosion measured 0.68 versus 0.31 for human maintained repos, agent verbosity 0.32 versus 0.11 for humans, costs grew 2.9x without correctness gains, and the highest strict solve rate across models was 17.2%. According to the thread, anti slop prompting reduced initial verbosity by 34.5% on GPT 5.4 but did not change the degradation slope, implying architectural incentives drive local optimizations that accumulate complexity—highlighting business risks for AI code assistants and the need for benchmarks that measure maintainability, extensibility, and lifecycle cost.

Source

2026-03-12
05:26

OpenClaw 2026.3.11 Release: Free 1M-Context Models via OpenRouter, GPT 5.4 Fix, Gemini Embedding 2, Go Support, and Security Hardening

According to @openclaw on X, OpenClaw’s 2026.3.11 release ships Hunter and Healer Alpha with free 1M-token context models available through OpenRouter, enabling ultra-long-context retrieval and RAG use cases at zero cost for developers (as reported by OpenClaw’s release notes on GitHub). According to OpenClaw, the update integrates Gemini Embedding 2 for improved long-term memory indexing, boosting vector search quality and memory recall in production pipelines. As reported by OpenClaw, GPT 5.4 behavior was tuned to prevent mid-thought stopping, reducing truncation issues in agent loops and code-gen tasks. According to the project’s GitHub release, OpenCode adds Go language support, expanding automated code assistance and test generation beyond Python and JS, while a dedicated security hardening sprint addresses dependency pinning, secret scanning, and sandbox tightening for safer model tooling. For businesses, according to OpenClaw, these changes lower LLM context costs, improve retrieval accuracy, and accelerate multi-language developer workflows, creating opportunities to build durable-memory agents and long-document analytics on top of OpenRouter and Gemini Embedding 2.

Source

2026-03-07
02:34

LLM Fiction Benchmark Analysis: Why GPT 5.4 Pro, Claude, and Gemini 3.1 Pro Still Struggle With 10-Paragraph Mystery Writing

According to Ethan Mollick on Twitter, a 10-paragraph murder-mystery benchmark exposes planning, clue calibration, and narrative consistency failures across leading LLMs, with Claude omitting key clues, ChatGPT 5.4 Pro over-signaling solutions, and Gemini 3.1 Pro mis-explaining an ice-based twist (as reported by Ethan Mollick on Twitter). According to Mollick, this task requires front-loading solvable but subtle evidence within five paragraphs while maintaining suspense, a structure that stresses multi-step narrative planning and constraint tracking in LLMs (according to Ethan Mollick on Twitter). For businesses deploying generative writing, the findings indicate risks in long-form content generation where hidden constraints matter—such as compliance narratives, educational case studies, and interactive fiction—highlighting the need for structured outline enforcement, tool-driven plot graphs, and post-hoc validation chains (according to Ethan Mollick on Twitter).

Source

List of AI News about GPT 5.4